Python for Clinical Study Reports and Submission

R/Pharma 2025 Workshop

Yilong Zhang, Nan Xiao

2025-11-07

Welcome

Outline

Four parts of this workshop:

  1. Python environment setup (Nan)
    Use uv to create and manage reproducible Python projects. Develop and collaborate in GitHub Codespaces, Visual Studio Code, or Positron.

  2. Python packages for clinical reporting (Yilong)
    A guided tour of essential packages such as polars, plotnine, and rtflite, with demonstrations of creating TLFs commonly used in clinical trials.

  3. Manage clinical trial A&R projects (Yilong)
    Practical project structure, conventions, and execution from data to deliverables.

  4. Prepare eCTD submission packages (Nan)
    An example workflow for assembling submission-ready source code and outputs using py-pkglite, aligned with eCTD requirements.

Disclaimer

The views and opinions expressed in this presentation are those of the individual presenters and do not represent those of their affiliated organizations or institutions.

Training objective

With Python, learning how to:

  • Create tables for clinical study reports
  • Organize clinical development projects effectively
  • Prepare eCTD submission packages to regulatory agencies

Note

The toolchain, process, and formats may be different in different organizations. We only provide one common way to address them.

Note

Interested in R? check https://r4csr.org/

Acknowledgements

  • R/Pharma organizers

    • It is a fun and productive annual gathering
    • Please consider sharing stories and use cases to expand the community
  • Team members from Meta Platforms and Merck & Co., Inc., Rahway, NJ, USA

  • Contributors of pycsr and r4csr training materials

    • Please consider submitting issues or PR in the repo

Preparation

In this workshop, we assume you have basic Python programming experience and clinical development knowledge.

  • Data manipulation: polars, plotnine, rtflite.
  • ADaM data: adsl, adae, etc.

Resource

  • Training material: https://pycsr.org/

  • During the workshop, we will use the pycsr project

    • Project link will be shared in chat
    • Post questions in group chat

Philosophy

We share the same automation philosophy as the R community described in Section 1.1 of the R Packages book and quote here.

  • “Anything that can be automated, should be automated.”
  • “Do as little as possible by hand. Do as much as possible with functions.”
  • “The goal is to spend your time thinking about what you want to do rather than thinking about the minutiae of package structure.”

Python environment setup

Development environments

Three recommended options:

GitHub Codespaces

  • Cloud-based, pre-configured
  • No local setup needed
  • 120 free hours/month

Positron

  • Posit’s next-gen IDE
  • Native notebook support
  • Built-in data viewer

VS Code

  • Most popular choice
  • Rich extension ecosystem
  • Essential extensions: Python, Pylance, Ruff, Quarto

Why uv?

uv is a modern Python package and project manager written in Rust.

Replaces scattered toolchain:

  • pip + venv + pyenv + pip-tools + setuptools

Benefits:

  • Fast: 10-100x faster than pip
  • Complete: Manages Python versions, dependencies, builds
  • Modern: Uses pyproject.toml as single source of truth
  • Reliable: Automatic dependency resolution and lock files

Installing uv

Skip if using GitHub Codespaces: uv is pre-installed there.

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Verify:

uv --version

Quick start with uv

# Create new project
uv init pycsr-example
cd pycsr-example

# Pin Python version
uv python pin 3.13.9

# Add dependencies
uv add polars plotnine rtflite

# Add dev dependencies
uv add --dev ruff pytest mypy

# Sync environment
uv sync

Python toolchain essentials

Ruff - Code formatting and linting

uv run ruff format .
uv run ruff check .

mypy - Type checking

uv run mypy src/

pytest - Testing framework

uv run pytest tests/

All configured in pyproject.toml.

Key concepts

Virtual environments are mandatory in Python

  • Isolate project dependencies
  • Prevent conflicts
  • Enable reproducibility

Dependency locking

  • uv.lock pins exact versions
  • Ensures reproducible environments
  • Similar to R’s renv.lock

.python-version file

  • Specifies exact Python version (e.g., 3.13.9)
  • Critical for regulatory submissions

Delivering TLFs in CSR

ICH E3 guidance

The ICH E3: structure and content of clinical study reports provide guidance to assist sponsors in the development of a CSR.

In a CSR, most of TLFs are located in:

  • Section 10: Study patients
  • Section 11: Efficacy evaluation
  • Section 12: Safety evaluation
  • Section 14: Tables, Figures and Graphs referred to but not included in the text
  • Section 16: Appendices

Datasets

Tools

  • polars: Python package for data manipulation similar to dplyr/tidyr R packages

  • rtflite: Python package for creating production-ready tables and figures in RTF format similar to R package r2rtf

polars intro

Why polars?

Modern Python dataframe library designed for performance and expressiveness.

Key advantages:

  • Fast: Written in Rust with parallel execution
  • Memory efficient: Lazy evaluation and streaming support
  • Type-safe: Strong type system prevents common errors
  • Modern API: Method chaining with clear, readable syntax

Core operations:

df.filter(pl.col("AGE") > 65)
df.group_by("TRT01P").agg(n = pl.len())
df.pivot(index="row", on="TRT01PN", values="n")

Essential patterns for TLFs

Counting participants:

df.group_by("TRT01P").agg(n = pl.len())

Calculating percentages:

.join(totals, on="TRT01P")
.with_columns(
    pct = (100.0 * pl.col("n") / pl.col("total")).round(1)
)

Pivoting to wide format:

.pivot(index="category", on="TRT01P", values="n")

Handling missing data

Fill nulls for categorical counts:

.with_columns(
    pl.col(["n_0", "n_54", "n_81"]).fill_null(0)
)

Use typed literals for schema consistency:

pl.lit(None, dtype=pl.Float64).alias("pct")

Count unique subjects (not events):

pl.col("USUBJID").n_unique()

rtflite intro

Motivation

In the pharmaceutical industry, RTF/Microsoft Word play a central role in preparing clinical study reports

Different organizations can have different table standards

  • For example, table layout, font size, border type, footnote, data source

  • rtflite is a Python package to create production-ready tables and figures in RTF format.

rtflite is designed to:

  • Provide simple Python classes that map to table elements (title, headers, body, footnotes) for intuitive table construction.
  • Offer a canonical Python API with a clear, composable interface.
  • Focus exclusively on table formatting and layout, leaving data manipulation to dataframe libraries like polars or pandas.
  • Minimize external dependencies for maximum portability and reliability.

Workflow

Before creating an RTF table, we need to:

  • Figure out table layout.

  • Split the layout into small tasks in the form of a computer program.

  • Execute the program.

Basic workflow

Three-step process:

  1. Prepare data: Use polars to calculate statistics
  2. Define structure: Create rtflite components (title, headers, body, footnotes)
  3. Generate RTF: Write output file for submission
doc = rtf.RTFDocument(
    df=your_dataframe,
    rtf_title=rtf.RTFTitle(text=["Your Table Title"]),
    rtf_column_header=rtf.RTFColumnHeader(...),
    rtf_body=rtf.RTFBody(...),
    rtf_source=rtf.RTFSource(text=["Source: ADSL"])
)

doc.write_rtf("output.rtf")

Package overview

rtflite package provides the flexibility to customize table appearance for

  • Table component: title, column header, footnote, etc.
  • Table cell style: size, border type, color, font size, text color, alignment, etc.
  • Flexible control: the specification of the cell style can be row or column vectorized.
  • Complicated format: pagination, section grouping, multiple table concatenations, etc.

rtflite package also provides the flexibility to convert figures in RTF format.

Simple example: adverse events

rtflite only focuses on table format. Data manipulation and analysis should be handled by other Python packages.

Key rtflite components

RTFTitle: Main and subtitle lines

rtf.RTFTitle(text=["Main Title", "Subtitle"])

RTFColumnHeader: Define column structure

rtf.RTFColumnHeader(
    text=["", "Placebo", "Low Dose", "High Dose"],
    col_rel_width=[3, 2, 2, 2],
    text_justification=["l", "c", "c", "c"]
)

RTFBody: Table content formatting

rtf.RTFBody(col_rel_width=[3, 2, 2, 2])

Multi-level headers

Create hierarchical column headers:

rtf_column_header=[
    rtf.RTFColumnHeader(
        text=["", "Placebo", "Xanomeline Low Dose", "Xanomeline High Dose"],
        col_rel_width=[3] + [2] * 3
    ),
    rtf.RTFColumnHeader(
        text=["", "n", "(%)", "n", "(%)", "n", "(%)"],
        col_rel_width=[3] + [1] * 6,
        border_top=[""] + ["single"] * 6,
        border_left=["single"] + ["single", ""] * 3
    )
]

Advanced features

Multiple tables in one document:

doc = rtf.RTFDocument(
    df=[table1, table2],  # List of DataFrames
    rtf_column_header=[header1, header2],
    rtf_body=[body1, body2]
)

Conditional formatting:

rtf.RTFBody(
    text_font_style=lambda df, i, j:
        "b" if not df[i, j].startswith("  ") else ""
)

Typical TLF workflow

Seven-step pattern seen across all examples:

  1. Load data: Read Parquet datasets (ADSL, ADAE, ADLBC)
  2. Filter population: Apply analysis population flags
  3. Calculate statistics: Group, aggregate, join
  4. Format values: Create display strings (Mean (SD), n (%))
  5. Reshape data: Pivot to wide format if needed
  6. Combine sections: Stack multiple result tables
  7. Generate RTF: Create publication-ready output

Focus on one step at a time - break complex tables into manageable pieces.

Common patterns

Reusable helper functions

Create functions for repeated operations:

def count_by_treatment(data, population_name):
    """Count participants by treatment group"""
    return data.group_by("TRT01P").agg(
        n = pl.len()
    ).with_columns(
        population = pl.lit(population_name)
    )

Benefits:

  • Reduce code duplication
  • Easier to maintain and test
  • Clear, self-documenting code

Handling hierarchical structures

Use indentation for subcategories:

pl.concat_str([pl.lit("    "), pl.col("DCREASCD")])

Build tables row by row when needed:

table_rows = []
table_rows.append(["Section Header", "", "", ""])
table_rows.append([f"  {category}", value1, value2, value3])

Maintain sort order with Enum:

.sort(pl.col("category").cast(pl.Enum(category_order)))

Statistical analysis integration

Use pandas bridge for statsmodels:

# Convert for analysis
ancova_df = gluc_data.to_pandas()

# Fit model
model = smf.ols("CHG ~ TRTP + BASE", data=ancova_df).fit()

# Convert back to polars for formatting
results_df = pl.DataFrame(results_dict)

Key libraries:

  • statsmodels: Linear models, ANCOVA
  • scipy.stats: Statistical tests, distributions

Best practices & pitfalls

Do:

  • Use .n_unique() to count unique subjects (not .len() on event data)
  • Fill nulls with typed literals: pl.lit(None, dtype=pl.Float64)
  • Sort pivoted results to ensure consistent column order
  • Use left joins with treatment levels to preserve all groups
  • Test with safety population (SAFFL == “Y”) for AE tables

Avoid:

  • Counting events when you need subject counts
  • Missing categories in pivoted data (use .fill_null(0))
  • Inconsistent data types causing schema errors
  • Forgetting to filter to analysis population
  • Hard-coded column names (treatment groups may vary)

Break and/or exercise (5 min)

CSR examples

Disposition table

Key concepts:

  • Track participant flow (enrolled -> completed/discontinued)
  • Use .pivot() to reshape data to wide format
  • Handle missing categories with .fill_null(0)
  • Multi-level headers with borders

Link: https://pycsr.org/tlf-disposition.html

Analysis population

Key concepts:

  • Document multiple analysis populations (ITT, efficacy, safety)
  • Population flags: ITTFL, EFFFL, SAFFL
  • Reusable helper functions
  • Conditional formatting: N for totals, N (%) for subsets

Link: https://pycsr.org/tlf-population.html

Baseline characteristics

Key concepts:

  • Separate functions for continuous vs categorical
  • Continuous: Mean (SD), Median [Min, Max]
  • Categorical: n (%)
  • Build tables with proper indentation

Link: https://pycsr.org/tlf-baseline.html

Efficacy table

Key concepts:

  • LOCF imputation for missing data
  • ANCOVA with statsmodels
  • LS means at baseline mean
  • Multiple table sections in one document
  • Comprehensive footnotes

Link: https://pycsr.org/tlf-efficacy-ancova.html

AE Summary table

Key concepts:

  • Count unique participants with .n_unique()
  • Standard AE categories (any, drug-related, serious, deaths)
  • Join with population totals for percentages
  • Multi-level column headers

Link: https://pycsr.org/tlf-ae-summary.html

Specific AE table

Key concepts:

  • Hierarchical structure: SOC -> Preferred Terms
  • Standardize terms with .str.to_titlecase()
  • Conditional formatting with lambda functions
  • Bold headers for top-level categories

Link: https://pycsr.org/tlf-ae-specific.html

Break (5 min)

Analysis package

What is an analysis package?

A Python package designed specifically to organize analysis scripts and code for a clinical trial project.

Purpose:

  • Project containers for clinical trial deliverables
  • Reproducible environments for analyses
  • Submission-ready structures for regulatory review

Combines:

  • Python package structure (code organization)
  • Quarto project (report generation)
  • Regulatory requirements (eCTD submission)

Package structure

demo-py-esub/
├── pyproject.toml          # Project metadata
├── .python-version         # Python version
├── uv.lock                 # Locked dependencies
├── src/demo001/            # Study-specific code
│   ├── __init__.py
│   └── utils.py
├── analysis/               # Quarto analysis docs
│   └── tlf-*.qmd
├── data/                   # ADaM datasets
├── output/                 # Generated TLFs
└── tests/                  # Validation tests

See: https://pycsr.org/pkg-structure.html

Benefits

Consistency

  • Standard structure across projects
  • Team knows where files belong

Reproducibility

  • uv.lock pins dependencies
  • .python-version specifies Python

Automation

  • uv sync restores environment
  • quarto render generates outputs
  • pytest validates code

Compliance

  • Built-in documentation
  • Testing infrastructure
  • Standard structure

Git-centric workflow

Core principle: All project assets in version control.

Plain text workflow:

  • .qmd files for analysis (not .ipynb for final deliverables)
  • .md files for documentation
  • .toml files for configuration
  • Avoid .xlsx files for tracking

Project tracking:

  • Issues for requirements
  • Pull requests for review
  • Project boards (Kanban)

See: https://pycsr.org/pkg-management.html

Development lifecycle

Planning:

  • Define TLFs from SAP
  • Create mock tables
  • Assign validation levels
  • Lock Python version and package repo

Development:

  • Create feature branches
  • Implement in analysis/ and src/
  • Self-test against mocks
  • Open pull requests

Validation:

  • Independent review
  • Write unit tests in tests/
  • Run automated checks (ruff, mypy, pytest)

Delivery:

  • Generate all outputs with quarto render
  • Prepare submission package

Break (5 min)

eCTD submission

FDA requirements

FDA Study Data Technical Conformance Guide Section 4.1.2.10:

Submit programs for primary and secondary efficacy analyses. Specify software in ADRG. Use ASCII text format. No executable extensions.

Goal: Enable reviewers to understand and confirm analysis algorithms.

See: https://pycsr.org/submission-overview.html

Demo repositories

Analysis package: https://github.com/elong0527/demo-py-esub

Submission package: https://github.com/elong0527/demo-py-ectd

Clone and explore to see complete examples.

eCTD Module 5 structure

m5/datasets/<study-id>/analysis/adam/
├── datasets/
│   ├── *.xpt               # ADaM datasets
│   ├── define.xml
│   ├── adrg.pdf            # Instructions
│   └── analysis-results-metadata.pdf
└── programs/
    ├── py0pkgs.txt         # Packed Python package
    ├── tlf-01-*.txt        # Analysis programs
    └── tlf-02-*.txt

Key: All files in programs/ must be ASCII text.

The solution: pkglite for Python

Packs Python projects into portable text files.

Why needed:

  • Python packages have directory structure
  • May contain binary files
  • FDA requires ASCII text format

pkglite capabilities:

  • Pack entire project into single .txt file
  • Preserve file paths and metadata
  • Unpack to restore original structure
  • Support multiple packages in one file

Documentation: https://pharmaverse.github.io/py-pkglite/

Packing workflow

1. Create .pkgliteignore

uvx pkglite use demo-py-esub/

2. Pack the package

uvx pkglite pack demo-py-esub/ \
  -o programs/py0pkgs.txt

3. Convert Quarto to Python scripts

  • Render .qmd -> verify it works
  • Convert .qmd -> .ipynb -> .py
  • Clean and format with ruff
  • Save as .txt (no .py extension)

See: https://pycsr.org/submission-package.html

Packed file format

Human-readable Debian Control File (DCF) format:

# Generated by py-pkglite
# Use `pkglite unpack` to restore

Package: demo-py-esub
File: pyproject.toml
Format: text
Content:
  [project]
  name = "demo001"
  version = "0.1.0"
  ...

Reviewers can read without special tools.

Updating ADRG

Document the Python environment:

Python environment:

Software Version Description
Python 3.13.9 Programming language
uv 0.9.7 Package manager

Packages:

Package Version Description
polars 1.35.1 Data manipulation
rtflite 1.0.2 RTF generation
demo001 0.1.0 Study functions

Appendix: Step-by-step reproduction instructions.

Dry run testing

Essential: Simulate reviewer experience before submission.

Workflow:

  1. Create clean directory
  2. Copy submission materials
  3. Unpack package: uvx pkglite unpack programs/py0pkgs.txt -o .
  4. Install dependencies: cd demo-py-esub && uv sync
  5. Run programs: python ../programs/tlf-*.txt
  6. Verify outputs match originals

Catches: Missing dependencies, path errors, platform issues.

See: https://pycsr.org/submission-dryrun.html

Q&A

Resources

Book:

Regulatory:

Technical: